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Abstract 

Modern parallel microprocessors deliver high performance on ap- 
plications that expose substantial fine-grained data parallelism. Al- 
though data parallelism is widely available in many computations, 
implementing data parallel algorithms in low-level languages is 
often an unnecessarily difficult task. The characteristics of paral- 
lel microprocessors and the limitations of current programming 
methodologies motivate our design of Copperhead, a high-level 
data parallel language embedded in Python. The Copperhead pro- 
grammer describes parallel computations via composition of fa- 
miliar data parallel primitives supporting both flat and nested data 
parallel computation on arrays of data. Copperhead programs are 
expressed in a subset of the widely used Python programming lan- 
guage and interoperate with standard Python modules, including 
libraries for numeric computation, data visualization, and analysis. 

In this paper, we discuss the language, compiler, and runtime 
features that enable Copperhead to efficiently execute data paral- 
lel code. We define the restricted subset of Python which Copper- 
head supports and introduce the program analysis techniques nec- 
essary for compiling Copperhead code into efficient low-level im- 
plementations. We also outline the runtime support by which Cop- 
perhead programs interoperate with standard Python modules. We 
demonstrate the effectiveness of our techniques with several ex- 
amples targeting the CUDA platform for parallel programming on 
GPUs. Copperhead code is concise, on average requiring 3.6 times 
fewer lines of code than CUDA, and the compiler generates effi- 
cient code, yielding 45-100% of the performance of hand-crafted, 
well optimized CUDA code. 

1. Introduction 

As the transition from sequential to parallel processing continues, 
the need intensifies for high-productivity tools and methodologies 
for parallel programming. Manual implementation of parallel pro- 
grams using efficiency languages such as C and C++ can yield 
high performance, but at a large cost in programmer productivity, 
which some case studies show as being 2 to 5 times worse than 
productivity-oriented languages such as Ruby and Python [16, 26]. 
Additionally, parallel programming in efficiency languages is of- 
ten viewed as an esoteric endeavor, to be undertaken by experts 
only. Without higher-level abstractions that enable easier parallel 
programming, parallel processing will not be widely utilized. Con- 
versely, today's productivity programmers do not have the means to 
capitalize on fine-grained, highly parallel microprocessors, which 
limits their ability to implement computationally intensive applica- 
tions. As we shall see, the flexibility provided by higher level data 
parallel abstractions can serve as a productive substrate for parallel 
programming. 

Although the need for higher-level parallel abstractions seems 
clear, perhaps less so is the type of abstractions which should be 
provided, since there are many potential abstractions that could be 
presented to productivity programmers. In our view, nested data 
parallelism, as introduced by languages such as NESL [4], is par- 



ticularly interesting. Nested data parallelism is abundant in many 
computationally intensive algorithms. It can be mapped efficiently 
to parallel microprocessors, which prominently feature hardware 
support for data parallelism. For example, mainstream x86 proces- 
sors from Intel and AMD are adopting 8-wide vector instructions, 
Intel's Larrabee processor used 16-wide vector instructions, and 
modern GPUs from NVIDIA and AMD use wider SIMD widths of 
32 and 64, respectively. Consequently, programs which don't take 
advantage of data parallelism relinquish substantial performance. 

Additionally, nested data parallelism as an abstraction clearly 
exposes parallelism. In contrast to traditional auto-parallelizing 
compilers, which must analyze and prove which operations can 
be parallelized and which can not, the compiler of a data-parallel 
language needs only to decide which parallel operations should be 
performed sequentially, and which should be performed in parallel. 
Accordingly, nested data parallel programs have a valid sequential 
interpretation, and are thus easier to understand and debug than par- 
allel programs that expose race conditions and other complications 
of parallelism. 

Motivated by these observations, Copperhead provides a set 
of nested data parallel abstractions, expressed in a restricted sub- 
set of the widely used Python programming language. Instead of 
creating an entirely new programming language for nested data 
parallelism, we repurpose existing constructs in Python, such as 
map and reduce. Embedding Copperhead in Python provides sev- 
eral important benefits. For those who are already familiar with 
Python, learning Copperhead is more similar to learning how to use 
a Python package than it is to learning a new language. There is no 
need to learn any new syntax, instead the programmer must learn 
only what subset of Python is supported by the Copperhead lan- 
guage and runtime. The Copperhead runtime is implemented as a 
standard Python extension, and Copperhead programs are invoked 
through the Python interpreter, allowing them to interoperate with 
the wide variety of Python libraries already extant for numeric com- 
putation, file manipulation, data visualization, and analysis. This 
makes Copperhead a productive environment for prototyping, de- 
bugging, and implementing entire applications, not just their com- 
putationally intense kernels. 

Of course, without being able to efficiently compile nested data 
parallel programs, Copperhead would not be of much use. Previous 
work on compilers for nested data parallel programming languages 
has focused on flattening transforms to target flat arrays of pro- 
cessing units. However, modern parallel microprocessors support a 
hierarchy of parallelism, with independent cores containing tightly 
coupled processing elements. Accordingly, the Copperhead com- 
piler maps nested data parallel programs to a nested parallelism 
hierarchy, forgoing the use of flattening transforms, which we find 
to substantially improve performance. 

The current Copperhead compiler targets CUDA C++, running 
on manycore Graphics Processors from NVIDIA. We generate ef- 
ficient, scalable parallel programs, performing within 45-100% of 
well optimized, hand- tuned CUDA code. Our initial implementa- 
tion focus on CUDA does not limit us to a particular hardware 



platform, as the techniques we have developed for compiling data- 
parallel code are widely applicable to other platforms as well. In 
the future, we anticipate the development of Copperhead compiler 
backends for other parallel platforms. 

This paper highlights several aspects of Copperhead. First, we 
describe the language itself, including a modest number of lan- 
guage constructs that, in composition, are expressive enough to 
represent a broad range of data-parallel applications. Secondly, we 
describe the analysis and code transformations which the Copper- 
head compiler performs in order to produce efficient parallel im- 
plementations. Thirdly, we illustrate the features of the Copper- 
head runtime which enable embedding and executing Copperhead 
code from within Python programs. Finally, we present perfor- 
mance results from our compiler which show that high-level, nested 
data-parallel computations can be efficiently compiled onto parallel 
hardware. 

2. Related Work 

There is an extensive literature investigating many alternative meth- 
ods for parallel programming. Data parallel approaches [2, 3, 14] 
have often been used, and have historically been most closely as- 
sociated with SIMD and vector machines, such as the the CM- 
2 and Cray C90, respectively. Blelloch et al. designed the NESL 
language [4], demonstrating a whole-program transformation of 
nested data parallel programs into flat data parallel programs. The 
flattening transform has been extended to fully higher-order func- 
tional languages in Data Parallel Haskell [7]. In contrast to these 
methods, we attempt to schedule nested procedures directly onto 
the hierarchical structure of the machines we target. 

The CUDA platform [24, 25, 28] defines a blocked SPMD 
programming model for executing parallel computations on GPUs. 
The Thrust [15] and CUDPP [9] libraries provide a collection of 
flat data parallel primitives for use in CUDA programs. Copperhead 
uses selected Thrust primitives in the code it generates. 

Systems for compiling flat data parallel programs for GPU 
targets have been built in a number of languages, including C# [30], 
C++ [22, 23], and Haskell [20]. Such systems typically define 
special data parallel array types and use operator overloading and 
metaprogramming techniques to build expression trees describing 
the computation to be performed on these arrays. The Ct [12] 
library adopts a similar model for programming more traditional 
multicore processors. However, these systems have not provided 
a means to automatically map nested data parallel programs to a 
hierarchically nested parallel platform. 

Rather than providing data parallel libraries, others have ex- 
plored techniques that mark up sequential loops to be parallelized 
into CUDA kernels [13, 21, 32]. In these models, the programmer 
writes an explicitly sequential program consisting of loops that are 
parallelizable. The loop markups, written in the form of C pragma 
preprocessor directives, indicate to the compiler that loops can be 
parallelized into CUDA kernels. 

There have also been some attempts to compile various sub- 
sets of Python to different platforms. The Cython compiler [10] 
compiles a largely Python-like language into sequential C code. 
Clyther [8] takes a similar approach to generating OpenCL kernels. 
In both cases, the programmer writes a sequential Python program 
which is transliterated into sequential C. Rather than transliterating 
Python programs, PyCUDA [19] provides a metaprogramming fa- 
cility for textually generating CUDA kernel programs. This allows 
the program to parametrically unroll loops and vary grid dimen- 
sions, among other things. Theano [31] provides an expression tree 
facility for numerical expressions on numerical arrays. Garg and 
Amaral [11] recently described a technique for compiling Python 
loop structures and array operations into GPU-targeted code. These 
projects all provide useful ways of interacting with parallel hard- 



ware and generating efficiency language code, but the programmer 
still expresses parallel computations isomorphically to the equiv- 
alent efficiency language code. Consequently, many of the draw- 
backs of efficiency level parallel programming still apply, such as 
requiring the programmer to explicitly map every computation to a 
particular parallel platform. Copperhead aims to solve a fairly dif- 
ferent problem, namely compiling a program from a higher level of 
abstraction into efficiently executable code. 

3. Copperhead Language 

A Copperhead program is a Python program that imports the Cop- 
perhead language environment: 

from copperhead import * 

A Copperhead program is executed, like any other Python program, 
by executing the sequence of its top-level statements. Selected pro- 
cedure definitions within the body of the program may be marked 
with the Copperhead decorator, as in the following: 

@cu 

def add_vectors(x, y) : 

return map(lambda xi,yi: xi+yi, x, y) 

This @cu decorator declares the associated procedure to be a Cop- 
perhead procedure. These procedures must conform to the require- 
ments of the Copperhead language, and they may be compiled for 
and executed on any of the parallel platforms supported by the Cop- 
perhead compiler. Once defined, Copperhead procedures may be 
called just like any other Python procedure, both within the pro- 
gram body or, as shown below, from an interactive command line. 

»> add_vectors( range(lO) , [2]*10) 
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 

The @cu decorator interposes a wrapper object around the native 
Python function object that is responsible for compiling and exe- 
cuting procedures on a target parallel platform. 

Copperhead is fundamentally a data-parallel language. It pro- 
vides language constructs, specifically map, and a library of prim- 
itives such as reduce, gather, and scatter that all have intrin- 
sically parallel semantics. They operate on 1 -dimensional arrays of 
data that we refer to as sequences. 

3.1 Restricted Subset of Python 

The Copperhead language is a restricted subset of the Python 2.6 
language [27]. Every valid Copperhead procedure must also be a 
valid Python procedure. Portions of the program outside procedures 
marked with the @cu decorator are unaffected by these restrictions, 
as they are normal Python programs that will be executed by the 
standard Python interpreter. The Copperhead language supports a 
very restricted subset of Python in order to enable efficient compi- 
lation. 

Copperhead adopts the lexical and grammatical rules of Python. 
In summarizing its restricted language, we denote expressions by 
E and statements by S. We use lower-case letters to indicate iden- 
tifiers and F and A to indicate function-valued and array-valued 
expressions, respectively. 

3.1.1 Expressions 

The most basic expressions are literal values, identifier references, 
tuple constructions, and accesses to array elements. 

E : x | [E u E n ) | A[E] 

| True | False | integer | floatnumber 

The basic logical and arithmetic operators are allowed, as are 
Python's and, or, and conditional expressions. 



I Ex+E 2 | E x < E 2 | not E \ ... 

| E x and E 2 \ Ex or E 2 \ Ex if E p else E 2 

Expressions that call and define functions must use only positional 
arguments. Optional and keyword arguments are not allowed. 



F(Ex 



) I lambda xx, 



Copperhead relies heavily on map as the fundamental source of 
parallelism and elevates it from a built-in function (as in Python) to 
a special form in the grammar. Copperhead also supports a limited 
form of Python's list comprehension syntax, which is de-sugared 
into equivalent map constructions during parsing. 



map(F, Ax, . . . , A n 
[E for x in A] 
[E for xx, . . . , x r 



) 

in zip(Ai, 



A n )] 



In addition to these grammatical restrictions, Copperhead Cop- 
perhead expressions must be statically well-typed. We employ a 
standard Hindley-Milner style polymorphic type system. Static typ- 
ing both provides richer information to the compiler that it can 
leverage during code generation and avoids the run-time overhead 
of dynamic type dispatch. 

3.1.2 Statements 

The body of a Copperhead procedure consists of a suite of state- 
ments: one or more statements S which are nested by indentation 
level. Each statement S of a suite must be of the following form: 



return E 

| Xx 5 • • • , Xn 



E 



if E\ suite else: 
def f{xx, . . . , x n ) : 



suite 
suite 



Copperhead further requires that every execution path within a 
suite must return a value, and all returned values must be of the 
same type. This restriction is necessary since every Copperhead 
procedure must have a well-defined static type. 

All data in a Copperhead procedure are considered immutable 
values. Thus, statements of the form x = E bind the value of the 
expression E to the identifier x; they do not assign a new value to 
an existing variable. All identifiers are strictly lexically scoped. 

Copperhead does not guarantee any particular order of evalua- 
tion, other than the partial ordering imposed by data dependencies 
in the program. Python, in contrast, always evaluates statements 
from top to bottom and expressions from left to right. By defini- 
tion, a Copperhead program must be valid regardless of the order 
of evaluations, and thus Python's mandated ordering is one valid 
ordering of the program. 

Because mutable assignment is forbidden and the order of eval- 
uation is undefined, the Copperhead compiler is free to reorder and 
transform procedures in a number of ways. As we will see, it uses 
this flexibility for a variety of purposes to improve the efficiency of 
generated code. 

3.2 Data-Parallel Primitives 

Copperhead is a data-parallel language. Programs manipulate data 
sequences by applying aggregate operations, such as map, or 
reduce. The semantics of these primitives are implicitly paral- 
lel: they may always be performed by some parallel computation, 
but may also be performed sequentially. 

Table 1 summarizes the main aggregate operations used by the 
examples in this paper. They mirror operations found in most other 
data-parallel languages. With two minor exceptions, these func- 
tions will produce the same result regardless of whether they are 
performed in parallel or sequentially. The pe rmute primitive is the 
primary exception. If its given indices array contains one or more 



@cu 

def spmv_csr(A_values, A_columns, x) : 
def spvv(Ai, j ) : 

z = qather (x, j ) 

return sum (map_( lambda Aij , xj : Aij*xj, Ai, z)) 

return map_(spvv, A_values, A_columns) 

Figure 1. Procedure for computing Ax for a matrix A in CSR 
form and a dense vector x. Underlined operations indicate potential 
sources of parallel execution. 



repeated index values, the resulting sequence is non-deterministic 
since we guarantee no particular order of evaluation. The other ex- 
ception is reduce. If given a truly commutative and associative 
operation, then its result will always be identical. However, pro- 
grams often perform reductions using floating point values whose 
addition, for instance, is not truly associative. 

To demonstrate our use of these operators, Figure 1 shows a 
simple Copperhead procedure for computing the sparse matrix- 
vector product (SpMV) y — Ax. Here we assume that A is 
stored in Compressed Sparse Row (CSR) format — one of the most 
frequently used representation for sparse matrices — and that x is 
a dense vector. The matrix representation simply records each row 
of the matrix as a sequence containing its non-zero values along 
with a corresponding sequence recording the column index of each 
value. A simple example of this representation is: 



1 7 0 0- 

0 2 8 0 

5 0 3 9 

0 6 0 4 



vals = [[1,7], [2,8], [5,3,9], [6,4]] 
cols = [[0,1], [1,2], [0,2,3], [1,3]] 



The body of spmv_cs r applies a sparse dot product procedure, 
spvv, to each row of the matrix using map. The sparse dot product 
itself produces the result yi for row i by forming the products AijXj 
for each column j containing a non-zero entry, and then summing 
these products together. It uses gather to fetch the necessary 
values Xj from the dense vector x and uses sum, a convenient 
special case of reduce where the operator is addition, to produce 
the result. 

This simple example illustrates two important issues that are a 
central concern for our Copperhead compiler. First, it consists of a 
number of potentially parallel aggregate operations, which are un- 
derlined. Second, these aggregate operators are nested: those within 
spvv are executed within an enclosing map operation invoked by 
spmv_csr. One of the central tasks of our compiler is to decide 
how to schedule these operations for execution. Each of them may 
be performed in parallel or by sequential loops, and the nesting of 
operations must be mapped onto the hardware execution units in a 
suitable fashion. 

4. Compiler 

The Copperhead compiler is a source-to- source compiler responsi- 
ble for converting a suite of one or more Copperhead procedures 
into a code module that can execute on the target parallel plat- 
form. We have designed it to support three basic usage patterns. 
First is what we refer to as JIT (Just-In-Time) compilation. When 
the programmer invokes a @cu-decorated function either from the 
command line or from a Python code module, the Copperhead run- 
time may need to generate code for this procedure if none is already 
available. Second is batch compilation where the Copperhead com- 
piler is asked to generate a set of C++ code modules for the speci- 
fied procedures. This code may be subsequently used either within 
the Copperhead Python environment or linked directly with exter- 



y = replicate(a, n) 

y = map(f , xl, . . . , xn) 

y = zip(xl, x2) 

y = gather(x, indices) 

y = permute(x, indices) 

s = reduce(©, x, prefix) 

y = scan(f, x) 



Return an n -element sequence whose every element has value a. 

Returns sequence [f(xl[0] , xn[0]), f(xl[l], . .., xn[l]), ...]. 

Return sequence of pairs [ (xl[0] ,x2[0] ) , (xl [ 1] , x2 [ 1] ) , ...]. 

Produce sequence with y [ i ] = x [ indices [i] ] . 

Produce y where y[ indices [i] ] = x[i]. 

Computes prefix 0 x[0] 0 x[l] 0 ... for a commutative and associative operator 0. 
Produce y such that y [ 0 ] =x [ 0 ] and y[i] = f ( y [ i - 1 ] , x[i]). Function f must be associative. 



Table 1. Selected data-parallel operations on sequences. 



# lambdaO :: (a, a) -+ a 
def lambda0(Aij , xj ) : 

return op_mul (Aij , xj ) 

# spvv ;; ([a], [Int], [a]) -+ 
def spvv(Ai, j , _k0) : 

zo = gather(_k0, j ) 

tmpo = map(lambda0, Ai, z 0 ) 

return sum(tmp 0 ) 



[a] 



[a] 



# spmv_csr :: ([[a]], [[Int]], [a]) 
def spmv_csr(A_values, A_columns, x) : 

return map(closure( [x] , spvv), A_values, 



A_columns) 



Figure 2. SpMV procedure from Figure 1 after transformation by 
the front end compiler. 



nal C++ applications. The third common scenario is one where the 
compiler is asked to generate a collection of variant instantiations 
of a Copperhead procedure in tandem with an autotuning frame- 
work for exploring the performance landscape of a particular ar- 
chitecture. 

In the following discussion, we assume that the compiler is 
given a single top level Copperhead function — referred to as the 
"entry point" — to compile. It may, for instance, be a function like 
spmv_cs r that has been invoked at the Python interpreter prompt. 
For the CUDA platform, the compiler will take this procedure, 
along with any procedures it invokes, and produce a single sequen- 
tial host procedure and one or more parallel kernels that will be 
invoked by the host procedure. 

4.1 Front End 

After parsing the program text using the standard Python ast mod- 
ule, the compiler front end converts the program into a simplified 
abstract syntax tree (AST) form suitable for consumption by the 
rest of the compiler. It applies a sequence of fairly standard pro- 
gram transformations to: lift all nested lambdas and procedures 
into top level definitions, convert all statements to single assign- 
ment form with no nested expressions, and infer static types for all 
elements of the program. Programs that contain syntax errors or 
that do not conform to the restricted language outlined in Section 3 
are rejected. 

For illustration, Figure 2 shows a program fragment represent- 
ing the AST for the Copperhead procedure shown in Figure 1 . Each 
procedure is annotated with its (potentially polymorphic) most gen- 
eral type, which we determine using a standard Hindley-Milner 
style type inference process. Lexically captured variables within 
nested procedures are eliminated by converting free variables to ex- 
plicit arguments — shown as variables _k0, _kl, etc. — and replac- 
ing the original nested definition point of the function with a special 
closu re ( [ ki , . . . , k n ] , f ) form where the ki represent the 
originally free variables. In our example, the local variable x was 
free in the body of the nested spvv procedure, and is thus con- 
verted into an explicit closure argument in the transformed code. 
Single assignment conversion adds a unique subscript to each lo- 
cal identifier (e.g., Zo in spvv), and flattening nested expressions 



introduces temporary identifiers (e.g., tmpo) bound to the value of 
individual sub-expressions. Note that our single assignment form 
has no need of the ^-functions used in SSA representations of im- 
perative languages since we disallow loops and every branch of a 
conditional must return a value. 

4.2 Scheduling Nested Parallelism 

The front end of the compiler carries out platform independent 
transformations in order to prepare the code for scheduling. The 
middle section of the compiler is tasked with performing analyses 
and scheduling the program onto a target platform. 

At this point in the compiler, a Copperhead program consists 
of possibly nested compositions of data parallel primitives. A Cop- 
perhead procedure may perform purely sequential computations, 
in which case our compiler will convert it into sequential C++ 
code. Copperhead makes no attempt to auto-parallelize sequential 
codes. Instead, it requires the programmer to use primitives that 
the compiler will auto-sequentialize as necessary. Our compila- 
tion of sequential code is quite straightforward, since we rely on 
the host C++ compiler to handle all scalar code optimizations, and 
our restricted language avoids all the complexities of compiling a 
broad subset of Python that must be addressed by compilers like 
Cython[10]. 

Copperhead supports both flat and nested data parallelism. A 
flat Copperhead program consists of a sequence of parallel primi- 
tives which perform purely sequential operations to each element 
of a sequence in parallel. A nested program, in contrast, may apply 
parallel operations to each sequence element. Our spmv_cs r code 
provides a simple concrete example. The outer spmv_csr proce- 
dure applies spvv to every row of its input via map. The spvv 
procedure itself calls gather, map, and sum, all of which are po- 
tentially parallel. The Copperhead compiler must decide how to 
map these potentially parallel operations onto the target hardware. 
This mapping depends on the composition of the program - the ex- 
act same code fragment may end up being implemented in very 
different ways, depending on the context in which it is instantiated. 

One approach to scheduling nested parallel programs, adopted 
by NESL [4] and Data Parallel Haskell [7] among others, is to 
apply a flattening (or vectorization) transform to the program. This 
converts a nested structure of vector operations into a sequence 
of flat operations. In most cases, the process of flattening replaces 
the nested operations with segmented equivalents. For instance, in 
our spmv_csr example, the nested sum would be flattened into a 
segmented reduction. 

The flattening transform is a powerful technique which ensures 
good load balancing. It also enables data parallel compilation for 
flat SIMD processors, where all lanes in the SIMD array must per- 
form the same operation. However, we take a different approach. 
Flattening transformations are best suited to machines that are truly 
flat. Most modern machines, in contrast, are organized hierarchi- 
cally. The CUDA programming model [24, 25], for instance, pro- 
vides a hierarchy of execution formed into four levels: 

1 . A host thread running on the CPU, which can invoke 

2. Parallel kernels running on the GPU, which consist of 

3. A grid of thread blocks each of which is comprised of 



4. Potentially hundreds of device threads running on the GPU. 

The central goal of the Copperhead compiler is thus to map the 
nested structure of the program onto the hierarchical structure of 
the machine. 

Recent experience with hand-written CUDA programs sug- 
gests that direct mapping of nested constructions onto this physical 
machine hierarchy often yields better performance. For instance, 
Bell and Garland [1] explored several strategies for implement- 
ing SpMV in CUDA. Their Coordinate (COO) kernel is imple- 
mented by manually applying the flattening transformation to the 
spmv_cs r algorithm. Their other kernels represent static mappings 
of nested algorithms onto the hardware. The flattened COO ker- 
nel only delivers the highest performance in the exceptional case 
where the distribution of row lengths is extraordinarily variable, 
in which case the load balancing provided by the flattening trans- 
form is advantageous. However, for 13 out of the 14 unstructured 
matrices they examine, applying the flattening transform results in 
performance two to four times slower than the equivalent nested 
implementation. 

Experiences such as this lead us to the conclusion that although 
the flattening transform can provide high performance in excep- 
tional cases where the workload is extremely imbalanced, the de- 
cision to apply the transform should be under programmer control, 
given the substantial overhead the flattening transform imposes for 
most workloads. 

Our compiler thus performs a static mapping of nested programs 
onto a parallelism hierarchy supported by the target parallel plat- 
form. 

Returning to our spmv_csr example being compiled to the 
CUDA platform, the map within its body will become a CUDA ker- 
nel call. The compiler can then chose to map the operations within 
spvv to either individual threads or individual blocks. Which map- 
ping is preferred is in general program and data dependent; ma- 
trices with very short rows are generally best mapped to threads 
while those with longer rows are better mapped to blocks. Because 
this information is not in general known until run time, we cur- 
rently present this decision as an option that can be controlled when 
invoking the compiler. The programmer can arrange the code to 
specify explicitly which is preferred, an autotuning framework can 
explore which is better on a particular machine, or a reasonable de- 
fault can be chosen based on static knowledge (if any) about the 
problem being solved. In the absence of any stated preference, the 
default is to map nested operations to sequential implementations. 
This will give each row of the matrix to a single thread of the kernel 
created by the call to map in spmv_cs r. 

4.3 Synchronization Points and Fusion 

The other major scheduling decision that the compiler must make is 
the placement of synchronization points between aggregate opera- 
tors. We assume that our target parallel platform provides a SPMD- 
like execution model, and thus we assume that synchronization be- 
tween a pair of parallel operations implies the equivalent of a bar- 
rier synchronization across parallel threads. Similarly, a synchro- 
nization point between a pair of operators that have been marked as 
sequentialized implies that they must be implemented with separate 
loop bodies. 

The most conservative policy would be to require a synchro- 
nization point following every parallel primitive. However, this can 
introduce far more synchronization, temporary data storage, and 
memory bandwidth consumption than necessary to respect data de- 
pendencies in the program. Requiring a synchronization point af- 
ter every parallel primitive can reduce performance by an order of 
magnitude or more, compared with equivalent code that synchro- 
nizes only when necessary. In order to produce efficient code, our 
compiler must introduce as few synchronization points as possible, 



by fusing groups of parallel primitives into single kernels or loop 
bodies. 

In order to determine where synchronization points are neces- 
sary, we need to characterize the data dependencies between in- 
dividual elements of the sequences upon which the parallel prim- 
itives are operating. Consider an aggregate operator of the form 
x = f (yi , . . . , y n ) where x and the are all sequences. We 
ignore any scalar arguments or results since they will not affect our 
analysis. Since the elements of x may be computed concurrently, 
we need to know what values of y^ it requires. 

We call the value x [ i ] complete when its value has been de- 
termined and may be used as input to subsequent operations. The 
array x is complete when all its elements are complete. 

We categorize the bulk of our operators into one of two classes. 
A third class of operators, which we treat specially and ignore in 
this analysis, are sequence constructors such as replicate. 

The first class are those operators which perform induction 
over the domain of their inputs. This class is essentially composed 
of permute, scatter, and their variants. We cannot in general 
guarantee that any particular x[i] is complete until the entire 
operation is finished. In other words, either the array x is entirely 
complete or entirely incomplete. 

The second class are those operators which perform induction 
over the domain of their outputs. This includes almost all the rest 
of our operators, such as map, gather, and scan. Here we want to 
characterize the portion of each yj that must be complete in order 
to complete x [ i ] . We currently distinguish three broad cases: 

• None: x [ i ] does not depend on any element of yj 

• Local: completing x [ i ] requires that y 3 ; [ i ] be complete 

• Global: completing x [ i ] requires that yj be entirely complete 

With deeper semantic analysis, we could potentially uncover more 
information in the spectrum between local and global completion 
requirements. However, at the moment we restrict our attention to 
this simple classification. 

Each primitive provided by Copperhead declares its completion 
requirements on its arguments (none, local, global) and whether 
its result is completed globally (class 1) or locally (class 2). The 
compiler then propagates this information through the body of user- 
defined procedures to compute their completion requirements. A 
synchronization point is required after any primitive of class 1 
and before any primitive that requires one of its arguments to be 
globally complete. We call a sequence of primitives that are not 
separated by any synchronization points a phase of the program. 
All operations within a single phase may potentially be fused into 
a single parallel kernel or sequential loop by the compiler. 

4.4 Shape Analysis 

Shape analysis determines, where possible, the sizes of all interme- 
diate values. This allows the back end of the compiler to statically 
allocate and reuse space for temporary values, which is critical to 
obtaining efficient performance. For the CUDA backend, forgoing 
shape analysis would require that every parallel primitive be exe- 
cuted individually, since allocating memory from within a CUDA 
kernel is not feasible. This would lead to orders of magnitude lower 
performance. 

Our internal representation gives a unique name to every tem- 
porary value. We want to assign to each a shape of the form 
([di, d n ], s) where d; is the array's extent in dimension 

i and s is the shape of each of its elements. The shape of a 4- 
element, 1 -dimensional sequence [5,4,8,1] would, for example, 
be ([4] , Unit) where Unit=([] , •) is the shape reserved for 
indivisible types such as scalars and functions. Nested sequences 
are not required to have fully determined shapes: in the case where 
subsequences have differing lengths, the extent along that dimen- 



sion will be undefined, for example: ([2], ([*], Unit)). Note 
that the dimensionality of all values are known as a result of type 
inference. It remains only to determine extents, where possible. 

We approach shape analysis of user-provided code as an abstract 
interpretation problem. We define a shape language consisting of 
Unit, the shape constructor (D, s) described above, identifiers, 
and shape functions extentof (s), elementof (s) that extract 
the two respective portions of a shape s. We implement a simple 
environment-based evaluator where every identifier is mapped to 
either (1) a shape term or (2) itself if its shape is unknown. Ev- 
ery primitive f is required to provide a function, that we denote 
f . shape, that returns the shape of its result given the shapes of its 
inputs. It may also return a set of static constraints on the shapes 
of its inputs to aid the compiler in its analysis. Examples of such 
shape-computing functions would be: 

gather. shape(x, i) = (extentof (i) , elementof (x)) 
zip. shape(xl, x2) = (extentof (xl) , 

(elementof (xl) , elementof (x2) )) 
with extentof (xl)==extentof(x2) 

The gather rule states that its result has the same size as the index 
array i while having elements whose shape is given by the elements 
of x. The zip augments the shape of its result with a constraint that 
the extents of its inputs are assumed to be the same. Terms such as 
extentof (x) are left unevaluated if the identifier x is not bound 
to any shape. 

To give a sense of the shape analysis process, consider the spvv 
procedure shown in Figure 2. Shape analysis will annotate every 
binding with a shape like so: 

def spvv(Ai, j , _k0) : 

zo :: (extentof (j ) , elementof (_k0)) 
tmpo :: (extentof (Ai) , elementof (Ai)) 
return sum(tmpo) :: Unit 

In this case, the shapes for z 0 , tmp 0 , and the return value are 
derived directly from the shape rules for gather, map, and sum, 
respectively. 

Shape analysis, is not guaranteed to find all shapes in a pro- 
gram. Some identifiers may have data dependent shapes, making 
the analysis inconclusive. For these cases, the compiler must treat 
computations without defined shapes as barriers, across which no 
fusion can be performed, and then generate code which computes 
the actual data dependent shape before the remainder of the com- 
putation is allowed to proceed. 

Future work involves allowing shape analysis to influence code 
scheduling: in an attempt to better match the extents in a particu- 
lar nested data parallel problem to the dimensions of parallelism 
supported by the platform being targeted. For example, instead of 
implementing the outermost level of a Copperhead program in a 
parallel fashion, if the outer extent is small and the inner extent is 
large, the compiler may decide to create a code variant which se- 
quentializes the outermost level, and parallelizes an inner level. 

4.5 CUDA C++ Back End 

The back end of the Copperhead compiler generates platform spe- 
cific code. As mentioned, we currently have a single back end, 
which generates code for the CUDA platform. Figure 3 shows an 
example of such code for our example spmv_csr procedure. It 
consists of a sequence of function objects for the nested lambdas, 
closures, and procedures used within that procedure. The templated 
function spmv_csr_phaseO corresponds to the Copperhead entry 

point. The CUDA entry point, indicated by the global mod- 

ifer, provides a C-like interface suitable for calling from Python. 

Not shown here is the host code that invokes the parallel kernel. 
It is the responsibility of the host code to marshal data where 
necessary, allocate any required temporary storage on the GPU, 



struct lambdaO { 

template<typename _a > device 

_a operator! ) (_a Aij, _a xj ) { return Aij * xj ; } 

}; 

struct spvv { 

template<typename _a > device 

_a operator! ) (stored_sequence<_a> Ai, 
stored_sequence<int> j, 
stored_sequence<_a> _k0) { 
gathered<. . .> z0 = gather(_k0, j); 
t ransformed<. . .> tmpO = 

t ransform<_a>(lambdaO( ) , Ai, z0); 
return seq : : sum(tmpO) ; 

} 

}; 

template<typename T2> 
struct spvv_closurel { 
T2 k0; 

device spvv_closurel(T2 _k0) : k0(_k0) { } 

template<typename T0, typename Tl> device 

_a operator! ) (T0 arg0, Tl argl) { 
return spvv()(arg0, argl, k0); 

} 

}; 

template<typename _a > device 

void spmv_csr_phase0(stored_sequence<_a> x, 

nested_sequence<int , 1> A_columns, 
nested_sequence<_a, 1> A_values, 
stored_sequence<_a> _return) { 
int i = threadldx.x + blockIdx.x*blockDim.x; 

if( i < A_values . size( ) ) 

_return[i] = spvv_closurel<_a>(x) (A_values [i] , 

A_columns [i] ) ; 

} 

extern "C" global void spmv_csr_kernel0_int ( . . . ) { 

// (1) Wrap raw pointers from external code 
// into sequence structures. 



// (2) Invoke the templated entry point 
spmv_csr_phase0(x, A_columns, A_values, .return); 

} 



Figure 3. Sample CUDA C++ code generated for spmv_cs r. El- 
lipses (...) indicate incidental type and argument information elided 
for brevity. 

and make the necessary CUDA API calls to launch the kernel. 
Currently, this host code is generated and executed in Python for 
convenience. 

We generate templated CUDA C++ code that makes use of a set 
of sequence types, such as stored_sequence and nested_sequence 
types, which hold sequences in memory. Fusion of sequential loops 
and block- wise primitives is performed through the construction of 
compound types. For example, in the spvv functor shown above, 
the calls to gat he r, and t ransf o rm perform no work, instead they 
construct gathered_sequence and t ransf ormed_sequence 
structures that lazily perform the appropriate computations upon 
dereferencing. Work is only performed by the last primitive in 
a set of fused sequential or block- wise primitives. In this exam- 
ple, the call to seq : : sum introduces a sequential loop which then 
dereferences a compound sequence, at each element performing 
the appropriate computation. When compiling this code, the C++ 



compiler is able to statically eliminate all the indirection present in 
this code, yielding machine code which is as efficient as if we had 
generated the fused loops directly. 

We generate fairly high level C++ code, rather than assembly 
level code, for two reasons. Firstly, existing C++ compilers pro- 
vide excellent support for translating well structured C++ to effi- 
cient machine code. Emitting C++ from our compiler enables our 
compiler to utilize the vast array of transformations which exist- 
ing compilers already perform. But more importantly, it means that 
the code generated by our Copperhead compiler can be reused in 
external C++ programs. We believe that an important use case for 
systems like Copperhead is to prototype algorithms in a high-level 
language and then compile them into template libraries that can be 
used by a larger C++ application. 

5. Runtime 

Copperhead code is embedded in standard Python programs. 
Python function decorators indicate which procedures should be 
executed by the Copperhead runtime. When a Python program 
calls a Copperhead procedure, the Copperhead runtime intercepts 
the call, compiles the procedure, and then executes it on a specified 
execution place. The Copperhead runtime uses Python's introspec- 
tion capabilities to gather all the source code pertaining to the pro- 
cedure being compiled. This model is inspired by the ideas from 
Selective, Embedded, Just-In-Time Specialization [6]. 

It is important to note that runtime compilation is done for de- 
veloper convenience rather than programmatic flexibility. Dynamic 
compilation is sometimes used to generate code which is very spe- 
cialized to the particular instantiation of the procedure being com- 
piled, for example, treating inputs as constants, tracing execution 
through conditionals, etc. However, the Copperhead compiler and 
runtime do not perform such optimizations. The results of the Cop- 
perhead compiler can be encapsulated as a standard, statically com- 
piled binary, and cached for future reuse or incorporated as libraries 
into standalone programs which are not invoked through the Python 
interpreter. 

Copperhead's CUDA runtime generates a set of C++ functions 
and CUDA kernels which represent the different phases of the pro- 
cedure, as well as a master Python procedure which is responsible 
for allocating data and executing the CUDA kernels. In the near fu- 
ture, we plan to transition code generation for the master procedure 
from Python to C++, to avoid the runtime overhead of repeatedly 
entering and exiting the Python interpreter as the various phases of 
a Copperhead procedure are executed. 

Once a particular procedure is compiled, the results are cached, 
so that subsequent calls to the same function do not require re- 
compilation. Copperhead uses PyCUDA [19] and CodePy [18] to 
provide mechanisms for compiling, persistent caching, linking and 
executing CUDA and C++ code. The Copperhead runtime uses 
Thrust [15] to implement fully parallel versions of certain data par- 
allel primitives, such as reduce and variations of scan. 

5.1 Places 

In order to manage the location of data and kernel execution across 
multiple devices, the Copperhead runtime defines a set of places 
that represent these heterogeneous devices. Data objects are created 
at a specific place. Calling a Copperhead procedure will execute a 
computation on the current target place, which is controlled via the 
Python with statement. 

Currently we support two kinds of places: CUDA capable GPUs 
and the native Python interpreter. Copperhead is designed to allow 
other types of places, with corresponding compiler back ends to be 
added. For instance, multi-core x86 back end would be associated 
with a new place type. 



To faciliate interoperability between Python and Copperhead, 
all data is duplicated, with a local copy in the Python interpreter, 
and a remote copy which resides at the place of execution. Data is 
lazily transferred between the local and remote place as needed by 
the program. 

5.2 Data Structures 

Copperhead adopts a common technique fo representing arbitrarily 
nested sequences as a flat sequence of data, along with a descriptor 
sequence for each level of nesting. The descriptor sequences pro- 
vide the necessary information to build a view of each subsequence, 
including empty subsequences. 

In addition to supporting arbitrarily nested sequences, Copper- 
head also allows programmers to construct uniformly nested se- 
quences, which support the important special case where the shape 
is completely defined. For such sequences, a set of strides and 
lengths are sufficient to describe the nesting structure - descriptor 
sequences are not required. Uniformly nested sequences also allow 
the data to be arbitrarily ordered: when the programmer creates a 
uniformly nested sequence, they either specify the data ordering or 
provide a tuple of strides directly. This allows the programmer to 
express data layouts analogous to row- and column-major ordering 
for doubly nested sequences, but for arbitrarily nested sequences. 
The programmer may provide the strides directly, which allows the 
subsequences to be aligned arbitrarily. Allowing the programmer 
to construct uniformly nested sequences takes advantage of knowl- 
edge the programmer may have about the data being used in a Cop- 
perhead program, and can provide important performance benefits 
when data access patterns with standard nested sequences are not 
matched well to the processor's memory hierarchy. 

The performance differential between using a uniform nested 
sequence versus a general nested sequence can be large, we have 
seen performance improvements of 2-3 times by using the correct 
data structure. At present, the programmer is responsible for choos- 
ing the format of uniformly nested sequences, although future work 
may investigate autotuning over alignments and data layouts. 

6. Examples 

In this section, we investigate the performance of Copperhead code 
in several example programs. As we compare performance, we 
compare to published, well-optimized, hand-crafted CUDA imple- 
mentations of the same computation. Our compiler can't apply all 
the transformations which a human can, so we don't expect to 
achieve the same performance as well-optimized code. Still, we 
aim to show that high-level data parallel computations can perform 
within striking distance of human optimized efficiency code. 

6.1 Sparse Matrix Vector Multiplication 

Continuing with the Sparse Matrix Vector Multiplication exam- 
ple, we examine the performance of Copperhead generated code 
for three different SpMV kernels: compressed sparse row, vector 
compressed sparse row, and ELL. The CSR kernel is generated by 
compiling the Copperhead procedure for CSR SpMV onto the stan- 
dard parallelism hierarchy, which distributes computations along 
the outermost parallelism dimension to independent threads. Data 
parallel operations are sequentialized into loops inside each thread. 
For the CSR kernel, each row of the matrix is then processed by a 
different thread, and consequently, adjacent threads are processing 
widely separated data. On the CUDA platform, this yields subopti- 
mal performance. 

The vector CSR kernel improves on the performance of the CSR 
kernel by mapping to a different parallelism hierarchy: one where 
the outermost level distributes computations among independent 
thread blocks, and subsequent data parallel operations are then 



sequentialized to block- wise operations. As mentioned earlier, the 
Copperhead programmer can choose which hierarchy to map to, or 
can choose to autotune over hierarchies. In this case, mapping to the 
block- wise hierarchy improves memory performance on the CUDA 
platform, since adjacent threads inside the same thread block are 
processing adjacent data, allowing for vectorized memory accesses. 

The ELL representation stores the nonzeros of the matrix in a 
dense M-by-K array, where K bounds the number of nonzeros per 
row, and rows with fewer than K nonzeros are padded. The array 
is stored in column-major order, so that after the computation has 
been mapped to the platform, adjacent threads will be accessing 
adjacent elements of the array. For example: 



A -- 



"1 7 0 0- 

0 2 8 0 

5 0 3 9 

0 6 0 4 



vals = [[1,2,5,6], [7,8,3,4], [*, 
cols = [[0,1,0,1], [1,2,2,3], [*, 



,9,*]] 
,3,*]] 



ELL generally performs best on matrices where the variance of 
the number of non-zeros per row is small. If exceptional rows with 
large numbers of non-zeros exist, ELL will perform badly due to 
the overhead resulting from padding the smaller rows. 
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Figure 4. Average Double Precision Sparse Matrix Vector Multi- 
ply Performance 

We compare against Cusp [1], a C++ library for Sparse Matrix 
Vector multiplication, running on an NVIDIA GTX 480 GPU. We 
use a suite of 8 unstructured matrices which were used by Bell and 
Garland [1], and which are amenable to ELL storage. Copperhead 
generated code achieves identical performance for the scalar CSR 
kernel, and on average provides 45% and 79% of Cusp performance 
for the vector CSR and ELL kernels, respectively. 

Our relatively low performance on the vector CSR kernel oc- 
curs because of a specialized optimization which the Cusp vector 
CSR implementation takes advantage of, but the Copperhead com- 
piler does not: namely the ability to perform "warp-wise" reduc- 
tions without synchronization, packing multiple rows of the SpMV 
computation into a single thread block. This optimization is an im- 
portant workaround for the limitations of today's CUDA proces- 
sors, but we considered it too special purpose to implement in the 
Copperhead compiler. 

6.2 Preconditioned Conjugate Gradient Linear Solver 

The Conjugate Gradient method is widely used to solve sparse sys- 
tems of the form Ax — b. We examine performance on a precon- 
ditioned conjugate gradient solver written in Copperhead, which 
forms a part of an fixed-point non-linear solver used in Variational 
Optical Flow methods [29]. Conjugate gradient performance de- 
pends strongly on matrix-vector multiplication performance. Al- 
though we could have used a preexisting sparse-matrix library and 
representation, in this case we know some things about the structure 
of the matrix, which arises from a coupled 5-point stencil pattern 
on a vector field. Taking advantage of this structure, we can achieve 
significantly better performance than any library routine by creat- 
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Figure 5. Left: closeup of video frame. Right: gradient vector field 
for optical flow 
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def 

@cu 
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@cu 
def 



vadd(x, y) : 

return map (lambda a, b: return a + b, x, y) 

vmul (x, y) : 

return map (lambda a, b: return a * b, x, y) 

form_preconditioner(a, b, c): 
def det_inverse(ai, bi, ci): 

return 1.0/(ai * ci - bi * bi) 
indets = map(det_inverse, a, b, c) 
p_a = vmul(indets, c) 

p_b = map(lambda a, b: -a * b, indets, b) 
p_c = vmul(indets, a) 
return p_a, p_b, p_c 

precondition^, v, p_a, p_b, p_c) : 

e = vadd(vmul(p_a, u), vmul(p_b, v)) 

f = vadd(vmul(p_b, u), vmul(p_c, v)) 
return e, f 



Figure 6. Forming and applying the block Jacobi preconditioner 



ing a custom matrix format and sparse matrix- vector multiplication 
routine. This is an ideal scenario for Copperhead, since the produc- 
tivity gains we provide make it more feasible for people to write 
custom algorithms. 

In addition to writing a custom sparse-matrix vector multipli- 
cation routine, practically solving this problem requires the use of 
a preconditioner, since without preconditioning, convergence is or- 
ders of magnitude slower. We utilize a block Jacobi preconditioner. 
Figure 6 shows the Copperhead code for computing the precondi- 
tioner, which involves inverting a set of symmetric 2x2 matrices, 
with one matrix for each point in the vector field, as well as apply- 
ing the preconditioner, which involes a large number of symmetric 
2x2 matrix multiplicaitons. 

We implemented the entire solver in Copperhead. The custom 
SpMV routine for this matrix runs within 10% of the hand-coded 
CUDA version, achieving 49 SP GFLOP/s on a GTX 480, whereas 
a hand- tuned CUDA version achieves 55 SP GFLOP/s on the same 
hardware. Overall, for the complete preconditioned conjugate gra- 
dient solver, the Copperhead generated code yields 7 1 % of the cus- 
tom CUDA implementation. 

Figure 5 was generated using thematplotlib Python library, 
operating directly from the Copperhead result of one of the steps 
in the solver, highlighting the ability of Copperhead programs to 
interoperate with other Python modules. 
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Figure 7. Performance on Preconditioned Conjugate Gradient 
Solver 



6.3 Quadratic Programming: Nonlinear Support Vector 
Machine Training 

Support Vector Machines are a widely used classification technique 
from machine learning. Support Vector Machines classify multi- 
dimensional data by checking where the data lies with respect to 
a decision surface. Nonlinear decision surfaces are learned via a 
Quadratic Programming optimization problem, which we imple- 
ment in Copperhead. 

We make use of the Sequential Minimal Optimization algo- 
rithm, with first-order variable selection heuristic [17], coupled 
with the RBF kernel, which is commonly used for nonlinear SVMs. 
Computationally, this is an iterative algorithm, where at each step 
the majority of the work is to evaluate distance functions between 
all data points and two active data points, update an auxiliary vec- 
tor and then to select the extrema from data-dependent subsets of 
this vector. More details can be found in the literature [5]. Space 
does not permit us to include the Copperhead code that implements 
this solver, but we do wish to point out the RBF kernel evaluation 
code, shown in Figure 9, is used both on its own, where the com- 
piler creates a version that instantiates the computation across the 
entire machine, as well as nested as an inner computation within a 
larger parallel context, where the compiler fuses it into a sequential 
loop. 

Manual Copperhead 
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Figure 8. Support Vector Machine Training Performance 

We compare performance against GPUS VM, a publically avail- 
able CUDA library for SVM training. Figure 8 shows the through- 
put of the Copperhead implementation of SVM training over four 
datasets. On average, the Copperhead implementation attains 51% 
of hand-optimized, well tuned CUDA performance, which is quite 
respectable. The main advantage of the hand coded CUDA imple- 
mentation on this example compared to the Copperhead compiled 
code is the use of on-chip memories to reduce memory traffic, and 
we expect that as the Copperhead compiler matures, we will be able 
to perform this optimization in Copperhead as well, thanks to our 
existing shape analysis, which can inform data placement. 



@cu 

def norm2_dif f (x, y) : 
def el (xi, yi) : 

diff = xi - yi 

return diff * diff 
return sum (map (el, x, y)) 

@cu 

def rbf (ngamma, x, y) : 

return exp(ngamma * norm2_dif f (x, y) ) 



Figure 9. RBF Kernel evaluation 



6.4 Productivity 

Productivity is difficult to measure, but as a rough approximation, 
Table 2 provides the number of lines of code needed to imple- 
ment the core functionality of the examples we have put forth, 
in C++/CUDA as well as in Python/Copperhead. Additionally, Ta- 
ble 2 shows the number of lines of code which were generated by 
the Copperhead compiler. On average, the Copperhead programs 
take about 4 times fewer lines of code than their C++ equivalents, 
which suggests that Copperhead programming is indeed more pro- 
ductive than the manual alternatives. 



Example 


CUDA & 
C++ 


Copperhead & 
Python 


Generated by 
Copperhead 


Scalar CSR 


16 


6 


90 


Vector CSR 


39 


6 


90 


ELL 


22 


4 


135 


PCG Solver 


172 


79 


789 


SVM Training 


429 


111 


431 



Table 2. Number of Lines of Code for Example Programs 



7. Future Work 

There are many avenues for improving the Copperhead runtime 
and compiler. The first set involves improving the quality of code 
which the compiler emits. For example, the compiler currently 
does not reorder parallel primitives to increase fusion and reduce 
synchronization. The compiler also does not attempt to analyze 
data reuse and place small, yet heavily reused sequences in on- 
chip memories to reduce memory bandwidth, nor does it attempt 
to take advantage of parallelism in the parallel primitive dataflow 
graph. Another avenue for future work involves broadening the 
scope of programs which can be compiled by the Copperhead 
compiler, such as supporting multi-dimensional arrays, adding new 
data parallel primitives, and supporting forms of recursion other 
than tail recursion. 

We also intend to create other back-ends for the Copperhead 
compiler, in order to support other platforms besides CUDA. Some 
obvious choices include support for multicore x86 processors and 
OpenCL platforms. 

8. Conclusion 

This paper has shown how to efficiently compile a nested data par- 
allel language to modern parallel microprocessors. Instead of using 
flattening transforms, we take advantage of nested parallel hard- 
ware by mapping data parallel computations directly onto the hard- 
ware parallelism hierarchy. This mapping is enabled by synchro- 
nization and shape analyses, which which we introduce. Instead of 
creating a new data parallel language, we have embedded Copper- 
head into a widely used productivity language, which makes Cop- 
perhead programs look, feel and operate like Python programs, in- 
teroperating with the vast array of Python libraries, and lowering 



the amount of investment it takes for a programmer to learn data 
parallel programming. 

We then demonstrated that our compiler generates efficient 
code, yielding from 45 to 100% of the performance of hand crafted, 
well-optimized CUDA code, but with much higher programmer 
productivity. On average, Copperhead programs require 3.6 times 
fewer lines of code than CUDA programs, comparing only the core 
computational portions. 

Large communities of programmers choose to use productiv- 
ity languages such as Python, because programmer productivity 
is often more important than attaining absolute maximum perfor- 
mance. We believe Copperhead occupies a worthwhile position in 
the tradeoff between productivity and performance, providing per- 
formance comparable to that of hand-optimized code, but with a 
fraction of the programmer effort. 
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